Code
library(tidyverse, verbose = FALSE)
library(tidymodels, verbose = FALSE)
library(reticulate)
library(ggplot2)
library(plotly)
library(RColorBrewer)
library(bslib)
library(Metrics)
reticulate::use_virtualenv("r-tf")Simone Brazzi
August 2, 2024
In prediction time, il modello deve ritornare un vettore contenente un 1 o uno 0 in corrispondenza di ogni label presente nel dataset (toxic, severe_toxic, obscene, threat, insult, identity_hate). In questo modo, un commento non dannoso sarà classificato da un vettore di soli 0 [0,0,0,0,0,0]. Al contrario, un commento pericoloso presenterà almeno un 1 tra le 6 labels.
Leveraging Quarto and RStudio, I will setup an R and Python enviroment.
Import R libraries. These will be used for both the rendering of the document and data analysis. The reason is I prefer ggplot2 over matplotlib. I will also use colorblind safe palettes.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import keras
import keras_nlp
from keras.backend import clear_session
from keras.models import Model, load_model
from keras.layers import TextVectorization, Input, Dense, Embedding, Dropout, GlobalAveragePooling1D, LSTM, Bidirectional, GlobalMaxPool1D, Flatten, Attention
from keras.metrics import Precision, Recall, AUC, SensitivityAtSpecificity, SpecificityAtSensitivity, F1Score
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import multilabel_confusion_matrix, classification_report, ConfusionMatrixDisplay, precision_recall_curve, f1_score, recall_score, roc_auc_scoreCreate a Config class to store all the useful parameters for the model and for the project.
I created a class with all the basic configuration of the model, to improve the readability.
class Config():
def __init__(self):
self.url = "https://s3.eu-west-3.amazonaws.com/profession.ai/datasets/Filter_Toxic_Comments_dataset.csv"
self.max_tokens = 20000
self.output_sequence_length = 911 # check the analysis done to establish this value
self.embedding_dim = 128
self.batch_size = 32
self.epochs = 100
self.temp_split = 0.3
self.test_split = 0.5
self.random_state = 42
self.total_samples = 159571 # total train samples
self.train_samples = 111699
self.val_samples = 23936
self.features = 'comment_text'
self.labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
self.new_labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate', "clean"]
self.label_mapping = {label: i for i, label in enumerate(self.labels)}
self.new_label_mapping = {label: i for i, label in enumerate(self.labels)}
self.path = "/Users/simonebrazzi/R/blog/posts/toxic_comment_filter/history/f1score/"
self.model = self.path + "model_f1.keras"
self.checkpoint = self.path + "checkpoint.lstm_model_f1.keras"
self.history = self.path + "lstm_model_f1.xlsx"
self.metrics = [
Precision(name='precision'),
Recall(name='recall'),
AUC(name='auc', multi_label=True, num_labels=len(self.labels)),
F1Score(name="f1", average="macro")
]
def get_early_stopping(self):
early_stopping = keras.callbacks.EarlyStopping(
monitor="val_f1", # "val_recall",
min_delta=0.2,
patience=10,
verbose=0,
mode="max",
restore_best_weights=True,
start_from_epoch=3
)
return early_stopping
def get_model_checkpoint(self, filepath):
model_checkpoint = keras.callbacks.ModelCheckpoint(
filepath=filepath,
monitor="val_f1", # "val_recall",
verbose=0,
save_best_only=True,
save_weights_only=False,
mode="max",
save_freq="epoch"
)
return model_checkpoint
def find_optimal_threshold_cv(self, ytrue, yproba, metric, thresholds=np.arange(.05, .35, .05), n_splits=7):
# instantiate KFold
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
threshold_scores = []
for threshold in thresholds:
cv_scores = []
for train_index, val_index in kf.split(ytrue):
ytrue_val = ytrue[val_index]
yproba_val = yproba[val_index]
ypred_val = (yproba_val >= threshold).astype(int)
score = metric(ytrue_val, ypred_val, average="macro")
cv_scores.append(score)
mean_score = np.mean(cv_scores)
threshold_scores.append((threshold, mean_score))
# Find the threshold with the highest mean score
best_threshold, best_score = max(threshold_scores, key=lambda x: x[1])
return best_threshold, best_score
config = Config()The dataset is accessible using tf.keras.utils.get_file to get the file from the url. N.B. For reproducibility purpose, I also downloaded the dataset. There was time in which the link was not available.
# A tibble: 5 × 8
comment_text toxic severe_toxic obscene threat insult identity_hate
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 "Explanation\nWhy the … 0 0 0 0 0 0
2 "D'aww! He matches thi… 0 0 0 0 0 0
3 "Hey man, I'm really n… 0 0 0 0 0 0
4 "\"\nMore\nI can't mak… 0 0 0 0 0 0
5 "You, sir, are my hero… 0 0 0 0 0 0
# ℹ 1 more variable: sum_injurious <dbl>
Lets create a clean variable for EDA purpose: I want to visually see how many observation are clean vs the others labels.
First a check on the dataset to find possible missing values and imbalances.
library(reticulate)
df_r <- py$df
new_labels_r <- py$config$new_labels
df_r_grouped <- df_r %>%
select(all_of(new_labels_r)) %>%
pivot_longer(
cols = all_of(new_labels_r),
names_to = "label",
values_to = "value"
) %>%
group_by(label) %>%
summarise(count = sum(value)) %>%
mutate(freq = round(count / sum(count), 4))
df_r_grouped# A tibble: 7 × 3
label count freq
<chr> <dbl> <dbl>
1 clean 143346 0.803
2 identity_hate 1405 0.0079
3 insult 7877 0.0441
4 obscene 8449 0.0473
5 severe_toxic 1595 0.0089
6 threat 478 0.0027
7 toxic 15294 0.0857
library(reticulate)
barchart <- df_r_grouped %>%
ggplot(aes(x = reorder(label, count), y = count, fill = label)) +
geom_col() +
labs(
x = "Labels",
y = "Count"
) +
# sort bars in descending order
scale_x_discrete(limits = df_r_grouped$label[order(df_r_grouped$count, decreasing = TRUE)]) +
scale_fill_brewer(type = "seq", palette = "RdYlBu")
ggplotly(barchart)It is visible how much the dataset in imbalanced. This means it could be useful to check for the class weight and use this argument during the training.
It is clear that most of our text are clean. We are talking about 0.8033 of the observations which are clean. Only 0.1967 are toxic comments.
To convert the text in a useful input for a NN, it is necessary to use a TextVectorization layer. See the Section 4 section.
One of the method is output_sequence_length: to better define it, it is useful to analyze our text length. To simulate what the model we do, we are going to remove the punctuation and the new lines from the comments.
# A tibble: 1 × 6
Min. `1st Qu.` Median Mean `3rd Qu.` Max.
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 91 196 378. 419 5000
library(reticulate)
boxplot <- df_r %>%
mutate(
comment_text_clean = comment_text %>%
tolower() %>%
str_remove_all("[[:punct:]]") %>%
str_replace_all("\n", " "),
text_length = comment_text_clean %>% str_count()
) %>%
# pull(text_length) %>%
ggplot(aes(y = text_length)) +
geom_boxplot() +
theme_minimal()
ggplotly(boxplot)library(reticulate)
df_ <- df_r %>%
mutate(
comment_text_clean = comment_text %>%
tolower() %>%
str_remove_all("[[:punct:]]") %>%
str_replace_all("\n", " "),
text_length = comment_text_clean %>% str_count()
)
Q1 <- quantile(df_$text_length, 0.25)
Q3 <- quantile(df_$text_length, 0.75)
IQR <- Q3 - Q1
upper_fence <- as.integer(Q3 + 1.5 * IQR)
histogram <- df_ %>%
ggplot(aes(x = text_length)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = upper_fence), color = "red", linetype = "dashed", linewidth = 1) +
theme_minimal() +
xlab("Text Length") +
ylab("Frequency") +
xlim(0, max(df_$text_length, upper_fence))
ggplotly(histogram)Considering all the above analysis, I think a good starting value for the output_sequence_length is 911, the upper fence of the boxplot. In the last plot, it is the dashed red vertical line.. Doing so, we are removing the outliers, which are a small part of our dataset.
Now we can split the dataset in 3: train, test and validation sets. Considering there is not a function in sklearn which lets split in these 3 sets, we can do the following: - split between a train and temporary set with a 0.3 split. - split the temporary set in 2 equal sized test and val sets.
x = df[config.features].values
y = df[config.labels].values
xtrain, xtemp, ytrain, ytemp = train_test_split(
x,
y,
test_size=config.temp_split, # .3
random_state=config.random_state
)
xtest, xval, ytest, yval = train_test_split(
xtemp,
ytemp,
test_size=config.test_split, # .5
random_state=config.random_state
)xtrain shape: py$xtrain.shape ytrain shape: py$ytrain.shape xtest shape: py$xtest.shape ytest shape: py$ytest.shape xval shape: py$xval.shape yval shape: py$yval.shape
The datasets are created using the tf.data.Dataset function. It creates a data input pipeline. The tf.data API makes it possible to handle large amounts of data, read from different data formats, and perform complex transformations. The tf.data.Dataset is an abstraction that represents a sequence of elements, in which each element consists of one or more components. Here each dataset is creates using from_tensor_slices. It create a tf.data.Dataset from a tuple (features, labels). .batch let us work in batches to improve performance, while .prefetch overlaps the preprocessing and model execution of a training step. While the model is executing training step s, the input pipeline is reading the data for step s+1. Check the documentation for further informations.
train_ds = (
tf.data.Dataset
.from_tensor_slices((xtrain, ytrain))
.shuffle(xtrain.shape[0])
.batch(config.batch_size)
.prefetch(tf.data.experimental.AUTOTUNE)
)
test_ds = (
tf.data.Dataset
.from_tensor_slices((xtest, ytest))
.batch(config.batch_size)
.prefetch(tf.data.experimental.AUTOTUNE)
)
val_ds = (
tf.data.Dataset
.from_tensor_slices((xval, yval))
.batch(config.batch_size)
.prefetch(tf.data.experimental.AUTOTUNE)
)train_ds cardinality: 3491
val_ds cardinality: 748
test_ds cardinality: 748
Check the first element of the dataset to be sure that the preprocessing is done correctly.
(array([b"I've read through the months-long discussion regarding the introduction, and seen some very helpful proposals from each of the main participants. I'd like to plead for at least a conservative interim revision of the unreadable and uninterpretable paragraph that is there now. By conservative, I mean to retaining the existing terminology, but to open with a positive declarative sentence without asides, exceptions and negatives. Hammer out the exact terminology later, and other contentious issues in continued discussion. That change will at least allow the visitor to orient on first arrival. Thanks!",
b'In addition, you may want to read this as it is explained by author Mike Campbell:\n\nThe TIGHAR-Nikumaroro Fiasco\nTwo years of speculation and hype came to a numbing climax on\nMarch 16, 1992, at the National Press Club in Washington,\nD.C. At a crowded news conference receiving national TV\ncoverage, Richard Gillespie, executive director of The International\nGroup for Historic Aircraft Recovery [TIGHAR], unabashedly\nannounced that the Amelia Earhart mystery \xe2\x80\x9cis solved.\xe2\x80\x9d\nSince early 1990, following the group\xe2\x80\x99s first excursion to Nikumaroro,\na 3.5-mile long coral atoll in the Phoenix Islands, about 2,100 miles\nsouthwest of Hawaii and three hundred miles south of Howland Island,\nthe national media had touted TIGHAR\xe2\x80\x99s theory that Nikumaroro could\nbe the final resting place of Amelia Earhart and Fred Noonan.\nThe \xe2\x80\x9cevidence\xe2\x80\x9d Gillespie presented included a battered piece of aluminum,\na weathered size 9 shoe sole labeled \xe2\x80\x9cCat\xe2\x80\x99s Paw Rubber Co.,\nUSA,\xe2\x80\x9d a small brass eyelet, and another unlabeled heel the group found\non Nikumaroro during TIGHAR\xe2\x80\x99s highly publicized return there in\nOctober 1991. These items, elaborately displayed and labeled in a glass\ncase, all came from Earhart or her Electra, according to Gillespie.\n\xe2\x80\x9cThere may be conflicting opinions, but there is no conflicting evidence,\xe2\x80\x9d\nGillespie said. \xe2\x80\x9cI submit that the case is solved.\xe2\x80\x9d Gillespie then\nbit his lip and looked down at the floor \xe2\x80\x94 a curious yet revealing display\nof body language for someone claiming to have solved one of the\n20th century\xe2\x80\x99s greatest mysteries.\nGillespie and Patricia R. Thrasher, TIGHAR\xe2\x80\x99s husband-and-wife\nteam, theorize that through a navigational error, Earhart and Noonan\noverflew Howland Island and landed on a reef on Nikumaroro during\nlow tide. There they died of dehydration a short time later. The plane,\nthey said, was washed over the reef when the tide came in and now lies\nunder 2,000 feet of ocean.\nThe April 1992 Life magazine featured a six-page spread penned by\nGillespie, who cites Navy pilot John Lambrecht\xe2\x80\x99s July 9, 1937 report:\n\xe2\x80\x9cSigns of recent human habitation were clearly visible [on\nNikumaroro], but \xe2\x80\x9crepeated circling and zooming failed to elicit any\nanswering wave from possible inhabitants.\xe2\x80\x9d Gillespie called this report\n\xe2\x80\x9chugely significant\xe2\x80\x9d and \xe2\x80\x9ctragically inadequate. What had not been\ndone in 1937 had to be done now. We would have to search\nNikumaroro.\xe2\x80\x9d\nGillespie credited two retired military aviators, Tom Gannon and\nTom Willi, of Fort Walton Beach, Florida, for the theory\xe2\x80\x99s origin.\n\xe2\x80\x9cUsing celestial tables, Gannon pointed out that on the morning of July\n2, 1937, the rising sun would have provided the precise line of position\nEarhart said she was running,\xe2\x80\x9d Gillespie wrote in Life. \xe2\x80\x9cBy flying\nsoutheast along that line, Noonan could be sure that, even if he missed\nHowland, he would reach an island in the Phoenix group in about two\nhours. Clearly it was the safest, sanest course to follow. I traced the line\non the chart and read the name of the island: Nikumaroro.\xe2\x80\x9d\nThis supposition was convincing to some, but Devine wasn\xe2\x80\x99t\nimpressed. \xe2\x80\x9cThere is considerable information in Gillespie\xe2\x80\x99s rendition that\ncan be faulted,\xe2\x80\x9d Devine writes. \xe2\x80\x9cOne example is his attempt to foster a line\nbetween Howland Island and Nikumaroro. He presents Earhart\xe2\x80\x99s last message\nas \xe2\x80\x98We are on line 157/337 ... We are running on line.\xe2\x80\x99\nEarhart never reached Howland Island. Earhart never\nsaw Howland Island, therefore Howland Island is not a reference\npoint for a landing on Nikumaroro. Earhart\xe2\x80\x99s authentic\nlast message received by Coast Guard personnel on board\nthe Itasca, and recorded by Commander Thompson, related\nno reference point: \xe2\x80\x98We are on line of position 157-337 ...\nWe are running north and south.\xe2\x80\x99 Earhart was not flying east\ntoward Howland Island or Hawaii. She was on a course, a\ncompass course or line of position 157-337, which terminates\non the island of Saipan.1\nWith Our Own Eyes\n\xe2\x80\x93 163 \xe2\x80\x93\n\xe2\x80\x93 164 \xe2\x80\x93\nEarhart was not adept at navigation, and may have stated\nher reading from a pocket compass she carried on her person.\nShe was probably and almost certainly stressed by ensuing\nevents, especially if Noonan had been rendered unconscious\nduring the almost disastrous takeoff of the overladen plane.\nIf Noonan had not been incapacitated, I assure you that he,\none of the best air-and-sea navigators of that period, would\nhave been communicating. He would have guided their\nplane on a pre-arranged course, and calculated for accuracy\nduring intervals of the flight.2\nDevine writes that another likely origin of Gill',
b'It is 510A , dimwit, the persecuted heroine, a scientific category, recognized by everyone except you. The sources are all there in the Wiki links. And nobody says that the Chinese version must come from the Greek one, all it says is that the Greek was an earlier Cinderella version. Since Windling has based her argument of the oriental origin on the temporal priority of Ye, the existence of the much earlier Greek version, shows her to be wrong. Either we include a part on Rhodopis or we exclude the misinformed Windling reserach. You choose. With your destructive approach, what are you doing anyway at Wikipedia? Regards',
b'"\n\nMy apologies, I thought when you mentioned you were related to the ""Jewish Giant"" you were ironically referring to Dianne Arbus herself, since it is to the JG himself you are related to my point still holds, as someone of at least partial Jewish ancestry why are you aiding the anti semites of the world?\n\nAs to my edit, you totally fail to answer the substance of my point: what is the point of identifying Noam Chomsky as ""jewish"" other than to lend legitimacy to his anti Jewish stance? And if that description stands, why is it a pov reference to indicate that while he may be of Jewish ancestry, he does not identify ""jewishly"" in any way. Look forward to your non ironical answer. "',
b"THEIR GOING TO BATH \n\nBUT THEY'RE A MUSLIM AND PAKI SO WHAT DO YOU EXPECT? 86.178.49.98",
b'That would make it the flag of the team using it, not the flag of the country. As with all national flag articles, they are about national flags, which the UB is not any more.',
b'Some of the 2010 movie My Lai Four by the italian producer Gianni Paolucci can be viewed on You Tube. This film looks like trash. If it is to be mentioned on this page then another movie about My Lai, by a Vietnamese film maker, Le Dan, also released last year, should also be mentioned. This is a film romanticization about William Calley, leader of the first platoon into My Lai.',
b'And this is why you should keep your watchlist nuked for a bit (assuming that T Canens meant what I think he meant, and you avoid the ban). Quit thinking about the quality of the encyclopaedia for a bit and focus on your peace of mind.',
b'Case settled? \n\nThere is currently a footnote saying the case was settled out of court on July 29, 2008 for an undisclosed sum, and that there is a gag order forbidding disclosure of how much the settlement is. However, the link given as reference is broken, and I was unable to find any other reference on the web. Could it really be that NO ONE wrote about the settlement in their blogs, when so much was said earlier about the lawsuit?',
b"Awww, my deepest appologies, here's a Kleenex.",
b'"\n\nThe only other option to ""evisceration"" might be to take the entire thing back to userspace until it is completed. Would that be better for you, D? Somehow I doubt it.\nBTW, you will be pleased to hear that I have been developing one of your stubs. Not much because of source limitations, but a bit nonetheless - Green Leaves. - "',
b"=Am I disruptive ?\nSteve, the criticism of Calo's paper is addressed in the two articles by Cloonan and Trell that are also in the Wikipedia article about Ruggero Santilli. These two articles were there, you ignored them and quickly eliminated the entire section. There was also another article about ongoing experiments that you ignored and cancelled...but you erased everything so nobody knows....in addition you and i are not qualified to take on pros or cons, just to report it and you avoided and prohibited that. I do not know where the idea of law suit came from. You are the one who got the three legal points and said that you were saying everything you wanted since no legal suit could be won against you . What a fine example of editorial ethics, I already told what I think about your position, but you ignored it. If you do not want to add the articles, as are in the Wikipedia page itself , I could not care less, it is the loss of the readers who do not read about ongoing research with its pros and cons. I thought it was interesting since I am sure there are more and that shows a vital field of research. However I take issue with associating Santilli, with fraud, kook, fringe and I am sure that it is against the rules of Wikipedia on libel and reckless behavior with the clear intent to damage since the fringe scientist definition is not supported by anything of value, except, perhaps some blogs and tabloids which, according to the rules, have no source value since they are not peer-reviewed articles! Unless you change the rules again. Scuranova",
b'Show up again when you figure out how to rub two brain cells together, and after you lose that filthy ego of yours!71.174.141.4',
b'While I believe the situation could have been handled better by everyone, there is nothing to be gained by further discord on this subject. You have been given an adequate chance to properly cite your additions; I suggest you either do so, or not, and move on. I have closed the discussion on my talk page and request that no further comment be made there. Thank you.',
b'"\nthere is no electron\nFacts:\n\n1) Faradays law of induction cannot be explained by electron theory Source: physics Textbook \n2) Maxwells electromatetic theory is in direct violation of electron theory. Most notably from his theory displacement current is still taught as mainstream science. Source physics Textbook \n3) In chemisty the number of electrons leaving a mass is determined by the voltage Source: chemistry textbook \n4) In induction physics, the current is determined by the number of coils in the winding source: physics textbooks \n5) In circiut theory the current is determined by the load Source: also physics textbook \n6) these three things are all different \n7) In the power industry, there is also something called ""current draw"" which is a current not determined by the load but the power supply Source: Con Edison training manual \n8) All of these things with the exception of 7 are excepted theories of science that contflict with the electron. Number 4 and 5 conflict with number 3 which is part of the definition of the electron. \nKnown people in the scientific community who said that they did not hold electrons to be true without proof(or admitted there existance to be different than that of mainstream science)\n\n1) Albert Einstien said that their existance was different \n2) Max Planck said that he would not hold it to be true without proof \n3) Faraday was totally against the idea before it ever came out \n4) Tesla in his patents refers to electricity as a pressure rather than a substance \n5) Heaviside said that electrons were compressed ether \n6) JJ Thomson the so called discover of the electron said that he did not agree with the electron theory that was interpreted from his experiment. \nAnyone else that a circuits textbook fails to mentions. Actually I find that most circuits textbooks don\'t even talk at all about any scientist. What I demand:\n\n1) A scientific presentation of electron theory that uses the scientific method \n2) A list of people in the scientific community who actually accepted this (and hold electrons to be a fact and more than a theory), and on what basis of the scientific method did they accept it."',
b"This is not your average not-many-people-know-about biography or that like article. It shouldn't be to much to ask that you allow a few days, or even a few hours, for everyone who has this article on his watchlist to comment or, in my case, to prepare a more elaborate comment. It certainly is the only way to avoid an edit war. Now I don't mind a full protection of this article due to another edit war - actually I think it would help since we do have a lot of argument to catch up on.",
b'Gov. Scott \n\nI recall shaking hands with the Governor in 1952 at the age of 5. I looked him in the eye and he looked me in the eye but I do not recall what he said. However, Concern about GMO and nutrition continues to be a major part of my life as well as technologies which can help anyone get to the bottom of an issue and completely understand it. I am also deeply committed to equality among races perhaps because my own race has the benefit of 4,000 years of cultural evolution which is not always shared but withheld from those who might improve it. I am truly a better person and fortunate that my uncle introduced me to Gov. Scott. ~',
b'Response - Tephrachronology is a precise and well established procedure. The Laarcher See Volcano is dated to within 40 years, 200 years before the Younger Dryas, and for you to imply that anyone in the geological community subscribes to your personal hypothesis that the Laarcher See volcanic eruption was related in any way to the Younger Dryas Chronozone - is simply false. I have no more plans to attempt to either educate you or edit your ramblings here.',
b"lonely\n\nGo outside and make a friend. You've seen all your mom's basement has to offer.",
b'fuck u little bitch. im the porn king',
b'Mr.Kumar rao! Thank you for your support about Rajus article,i have seen your message in my talk page.All the Best. -abcde',
b'Your messages. \n\nI am sorry but what are you talking about?',
b"By the way, if you don't think MacDonald qualifies as an anthropologist, you might want to examine what sort of Ph. D.s and professorships anthropologists usually have or have had. Here is MacDonald's quite impressive curriculum vitae: http://www.csulb.edu/~kmacd/VITA2005.pdf",
b'Visit http://www.WebNoeSys.com\nMail us at support@WebNoeSys.com, contact@WebNoeSys.com',
b'"\n\n Powers and Abilities \n\nMy addition to the Hulk page was removed by a chap named Cameron Scott, but so far he has failed to explain exactly why he did so. My addition to the P+A section simply stated that Hulk\'s one ""vulnerability"" is that he can be made weaker when calmed down. This has happened in various comics and so is accurate information. Any thoughts? \xe2\x80\x94Preceding unsigned comment added by (talk \xe2\x80\xa2 contribs) \n\nIt\'s been removed for a few reasons. One, you\'re violating WP:SOCK, using multiple accounts to try to make it looks like there\'s more support for your edit than there is. Two, Using one instance of an idea, citing a comic book for it, is not enough. Hulk has been stopped in other ways - Magic, Drugs, being knocked out cold. That his opponents have used the tactic of defusing his anger to allow other methods of defeat to work is also different than that alone being the defeat itself. "',
b"Deletion discussion about Afif Chaya \nHello, Ammar Alwaeli, \n\nI wanted to let you know that there's a discussion about whether Afif Chaya should be deleted. Your comments are welcome at Wikipedia:Articles for deletion/Afif Chaya . \n\nIf you're new to the process, articles for deletion is a group discussion (not a vote!) that usually lasts seven days. If you need it, there is a guide on how to contribute. Last but not least, you are highly encouraged to continue improving the article; just be sure not to remove the tag about the deletion nomination from the top. \n\nThanks, \xe2\x80\x94 \xc2\xbb\xc2\xbb (talk to me)",
b'"\n\nThis is better, but it does talk about ""simultaneous values"". "',
b'yeah \n\nbut scientology is still a fucking joke. 70.92.103.13',
b"your atricle demonstrates absolutely nothing. first, it nowhere says Fangio was the greatest. Second, it rests its arguemnt on the grounds that he won 5 C in 8 years, which is supposedly better than Shumacher's. But wait it's not!. Schumacher won 5 WC in FIVE YEARS. Gone.",
b'" May 2009 (UTC)\n\nDo you mean you can read about those work conditions & not see that it\'s slavery? Prepare yourself for earthshaking news: the wage system is slavery but most people didn\'t think so. Slavery means ""someone who is controlled by someone or something."" A boss controls a slave, but a slave also enslaves a boss. Being offered a wage you can\'t live on is slavery. All jobs are being eliminated by machine-slaves, which makes it more obvious than ever before that all people must have a Guaranteed Minimum Income, although I prefer the word \'residual\' to \'minimum\'. Read child labour & try to see it was child slavery into the 20th century, & it still is today since many children work in their parents\' business. When they ""ended child labour"" that didn\'t set the children free; they were ""free to starve"" because what they need/ed is a GMI. Having children is a form of slavery too: parents become slaves to their children (when they need to eat, bathe, etc) but the children are also slaves to their parents. So ""no control"" also means we\'ll always be slaves to needing food, & we can find ways to eliminate all the work with machines which again means all people need a GMI. (Actually all people should own all things, because when a few rich people own everything it\'s obviously slavery; how did every person not know that? America refused to let any attempts at Socialism & Communism succeed; instead they forced it to fail by starting many wars, teaching torture, & you must read Wm Blum\'s ""Rogue State"". Fidel Castro was right & America was wrong.) And the BLS is wrong; they always underestimate, because most of those people are probably illegal immigrants from many generations back. They say every year 50,000 people are smuggled into the US to work as prostitutes (that\'s slavery), so can you imagine how many illegals there must be in America hidden behind walls, & all supporting a few rich slave masters who travel the world & raise their families on the backs of SLAVES in America. How can you read about multinational corporations & not see they cause poverty in those other countries, as well as in America? If Liberia has a population of 3 million & one corporation goes there & ""creates only 2000 jobs"" then can\'t anyone see the wage is causing their poverty?? That\'s just an example, but do you see what I\'m saying about the wage is the problem worldwide, not the solution, like America thinks? When 3,000 people show up to apply for 20 job openings, you know the wage is wrong & it\'s slavery. I know I\'m a slave, & I know every person is a slave & I\'m right & everyone is wrong. Gerry Spence also says we\'re slaves in Give Me Liberty. That\'s all for now. 02:08, 31"',
b"It's not about some obscure song. Rather, it's about a very famous song that purportedly was written about her. That's more than trivia. It's a part of her life that she has talked about on many occasions.",
b'possible racism of user rodhullandemu - as this is my talk page, i can of course decide the headings. this is not my last warning. i do like to eat pineapples.'],
dtype=object), array([[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0],
[1, 1, 1, 0, 1, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[1, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0]]))
And we check also the shape. We expect a feature of shape (batch, ) and a target of shape (batch, number of labels).
text train shape: (32,)
text train type: object
label train shape: (32, 6)
label train type: int64
Of course preprocessing! Text is not the type of input a NN can handle. The TextVectorization layer is meant to handle natural language inputs. The processing of each example contains the following steps: 1. Standardize each example (usually lowercasing + punctuation stripping) 2. Split each example into substrings (usually words) 3. Recombine substrings into tokens (usually ngrams) 4. Index tokens (associate a unique int value with each token) 5. Transform each example using this index, either into a vector of ints or a dense float vector.
For more reference, see the documentation at the following link.
text_vectorization = TextVectorization(
max_tokens=config.max_tokens,
standardize="lower_and_strip_punctuation",
split="whitespace",
output_mode="int",
output_sequence_length=config.output_sequence_length,
pad_to_max_tokens=True
)
# prepare a dataset that only yields raw text inputs (no labels)
text_train_ds = train_ds.map(lambda x, y: x)
# adapt the text vectorization layer to the text data to index the dataset vocabulary
text_vectorization.adapt(text_train_ds)This layer is set to: - max_tokens: 20000. It is common for text classification. It is the maximum size of the vocabulary for this layer. - output_sequence_length: 911. See Figure 3 for the reason why. Only valid in "int" mode. - output_mode: outputs integer indices, one integer index per split string token. When output_mode == “int”, 0 is reserved for masked locations; this reduces the vocab size to max_tokens - 2 instead of max_tokens - 1. - standardize: "lower_and_strip_punctuation". - split: on whitespace.
To preserve the original comments as text and also have a tf.data.Dataset in which the text is preprocessed by the TextVectorization function, it is possible to map it to the features of each dataset.
processed_train_ds = train_ds.map(
lambda x, y: (text_vectorization(x), y),
num_parallel_calls=tf.data.experimental.AUTOTUNE
)
processed_val_ds = val_ds.map(
lambda x, y: (text_vectorization(x), y),
num_parallel_calls=tf.data.experimental.AUTOTUNE
)
processed_test_ds = test_ds.map(
lambda x, y: (text_vectorization(x), y),
num_parallel_calls=tf.data.experimental.AUTOTUNE
)Define the model using the Functional API.
def get_deeper_lstm_model():
clear_session()
inputs = Input(shape=(None,), dtype=tf.int64, name="inputs")
embedding = Embedding(
input_dim=config.max_tokens,
output_dim=config.embedding_dim,
mask_zero=True,
name="embedding"
)(inputs)
x = Bidirectional(LSTM(256, return_sequences=True, name="bilstm_1"))(embedding)
x = Bidirectional(LSTM(128, return_sequences=True, name="bilstm_2"))(x)
# Global average pooling
x = GlobalAveragePooling1D()(x)
# Add regularization
x = Dropout(0.3)(x)
x = Dense(64, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.01))(x)
x = LayerNormalization()(x)
outputs = Dense(len(config.labels), activation='sigmoid', name="outputs")(x)
model = Model(inputs, outputs)
model.compile(optimizer='adam', loss="binary_crossentropy", metrics=config.metrics, steps_per_execution=32)
return model
lstm_model = get_deeper_lstm_model()
lstm_model.summary()Finally, the model has been trained using 2 callbacks: - Early Stopping, to avoid to consume the kaggle GPU time. - Model Checkpoint, to retrieve model training information.
Considering the dataset is imbalanced, to increase the performance we need to calculate the class weight. This will be passed during the training of the model.
class_weight
toxic 0.095900590
severe_toxic 0.009928468
obscene 0.052757858
threat 0.003061800
insult 0.049132042
identity_hate 0.008710911
It is also useful to define the steps per epoch for train and validation dataset. This step is required to avoid
The fit has been done on Kaggle to levarage the GPU. Some considerations about the model:
.repeat() ensure the model sees all the daataset.epocs is set to 100.validation_data has the same repeat.callbacks are the one defined before.class_weight ensure the model is trained using the frequency of each class, because our dataset is imbalanced.steps_per_epoch and validation_steps depend on the use of repeat.Now we can import the model and the history trained on Kaggle.
# A tibble: 5 × 2
metric value
<chr> <dbl>
1 loss 0.0542
2 precision 0.789
3 recall 0.671
4 auc 0.957
5 f1_score 0.0293
For the prediction, the model does not need to repeat the dataset, because the model has already been trained and now it has just to consume the data to make the prediction.
The best way to assess the performance of a multi label classification is using a confusion matrix. Sklearn has a specific function to create a multi label classification matrix to handle the fact that there could be multiple labels for one prediction.
Grid Search CV is a technique for fine-tuning hyperparameter of a ML model. It systematically search through a set of hyperparamenter values to find the combination which led to the best model performance. In this case, I am using a KFold Cross Validation is a resempling technique to split the data into k consecutive folds. Each fold is used once as a validation while the k - 1 remaining folds are the training set. See the documentation for more information.
The model is trained to optimize the recall. The decision was made because the cost of missing a True Positive is greater than a False Positive. In this case, missing a injurious observation is worst than classifying a clean one as bad.
Having said this, I still want to test different metrics other than the recall_score to have more possibility of decision of the best threshold.
Optimal threshold: 0.15000000000000002
Best score: 0.4788653077945807
Optimal threshold f1 score: 0.15. Best score: 0.4788653.
Optimal threshold recall: 0.05. Best score: 0.8095814.
Optimal threshold: 0.05
Best score: 0.8809499649742268
Optimal threshold roc: 0.05. Best score: 0.88095.
The confusion matrix is plotted using the multilabel_confusion_matrix function in scikit-learn. We have to plot a confusion matrix for each label. To plot the confusion matrix, we need to convert the predicted probability of a label to a proper prediction. To do so, we use the calculated optimal threshold for the recall, which is 0.05. The confusion matrix plotted hete, considering we have a multi label task, is not a big one with all the labels as columns and indices. We plot a confusion matrix for each label with a simple for loop, which extract for each loop the confusion matrix and the associated label.
# convert probability predictions to predictions
ypred = predictions >= optimal_threshold_recall # .05
ypred = ypred.astype(int)
# create a plot with 3 by 2 subplots
fig, axes = plt.subplots(3, 2, figsize=(15, 15))
axes = axes.flatten()
mcm = multilabel_confusion_matrix(ytrue, ypred)
# plot the confusion matrices for each label
for i, (cm, label) in enumerate(zip(mcm, config.labels)):
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(ax=axes[i], colorbar=False)
axes[i].set_title(f"Confusion matrix for label: {label}")
plt.tight_layout()
plt.show()
# A tibble: 10 × 5
metrics precision recall `f1-score` support
<chr> <dbl> <dbl> <dbl> <dbl>
1 toxic 0.552 0.890 0.682 2262
2 severe_toxic 0.236 0.917 0.375 240
3 obscene 0.550 0.936 0.692 1263
4 threat 0.0366 0.493 0.0681 69
5 insult 0.471 0.915 0.622 1170
6 identity_hate 0.116 0.720 0.200 207
7 micro avg 0.416 0.896 0.569 5211
8 macro avg 0.327 0.812 0.440 5211
9 weighted avg 0.495 0.896 0.629 5211
10 samples avg 0.0502 0.0848 0.0597 5211
The BiLSTM model optimized to have an high recall is performing good enough to make predictions for each label, except for the threat one. See Table 2 and Figure 1: the threat label is only 0.27 % of the observations. The model has been optimized for recall because the cost of not identifying a injurious comment as such is higher than the cost of considering a clean comment as injurious.
Possibile improvements could be to increase the number of observations, expecially for the threat one. In general there are too many clean comments. This could be avoided doing an undersampling of the clean comment, which I explicitly avoided to check the performance on the BiLSTM with an imbalanced dataset, leveraging the class weight method.